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Abstract 

We  consider  learning  algorithms  induced  by  regularization  methods  in  the  regression 
setting.  We  show  that  previously  obtained  error  bounds  for  these  algorithms  using  a-priori 
choices  of  the  regularization  parameter,  can  be  attained  using  a  suitable  a-posteriori  choice 
based  on  validation.  In  particular,  these  results  prove  adaptation  of  the  rate  of  convergence 
of  the  estimators  to  the  minimax  rate  induced  by  the  ’’effective  dimension”  of  the  problem. 
We  also  show  universal  consistency  for  this  class  methods. 
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1.  Introduction 

We  show  that  previous  results  in  [2]  about  rates  of  convergence  for  regularization  meth¬ 
ods  using  a-priori  choices  of  the  regularization  parameter,  can  be  attained  using  a  suitable 
a-posteriori  choice  based  on  validation.  We  also  show  universal  consistency  for  this  class 
methods.  The  framework  for  semi-supervised  statistical  learning  theory  is  the  same  one 
considered  in  [2] .  The  algorithms  we  consider  are  based  on  the  formalism  of  regularization 
methods  for  linear  ill-posed  inverse  problems  in  their  classical  setting  (see  for  example  [11] 
for  general  reference).  Some  popular  algorithms  from  this  class  are:  regularized  least- 
squares,  truncated  SVD,  Landweber  method  and  u- met  hod. 

The  paper  is  organized  as  follows.  In  Section  2  we  focus  on  a-priori  choices  of  the 
regularization  parameter  for  regularization  methods.  Theorem  1  shows  universal  consis¬ 
tency  for  a  large  class  of  choice  rules,  and  Theorem  2  shows  specific  rates  of  convergence 
under  suitable  prior  assumptions  (parameterized  by  the  constants  r,  s,  Cr  and  Ds)  on  the 
unknown  probability  measure  p.  Unlabelled  data  are  added  to  the  training  set  in  order  to 
improve  the  rates  for  a  certain  range  of  the  parameters  r  and  s. 

In  Section  3  we  consider  a  validation  technique  for  the  a-posteriori  choice  of  the  reg¬ 
ularization  parameter.  Theorem  3  shows  how  error  bounds  for  the  estimators  fi,\,  with 
a-priori  choices  of  A,  can  be  transferred  to  the  estimators  /z tot  which  use  the  validation 
examples  zv  in  zto  =  (z,zv)  to  determine  A.  The  subsequent  corollaries  are  applications 
of  Theorem  3  to  the  choices  of  A  described  in  Section  2. 

In  Sections  4  and  5  we  give  the  proofs  of  the  results  stated  in  the  previous  Sections, 
using  some  lemmas  from  [2]. 

2.  A-priori  choice  of  the  regularization  parameter. 

We  consider  the  setting  of  semi-supervised  statistical  learning.  We  assume  that  Y  C 
[— M,  M]  and  we  let  the  supervised  part  of  the  training  set  be  equal  to 

Z  —  (ZT ,  ...  ,  Zm ) , 

with  Zi  =  ( Xi,yi )  drawn  i.i.d.  according  to  the  probability  measure  p  over  Z  =  X  x  Y. 
Moreover  we  assume  that  the  unsupervised  part  of  the  training  set  is  . . . ,  i*),  with 

a;“  drawn  i.i.d.  according  to  the  marginal  probability  measure  over  X,  px  ■  For  sake  of 
brevity  we  also  introduce  the  complete  training  set 

z  =  (zi, . . . ,  Zft), 

with  Zi  =  ( ),  where  we  introduced  the  compact  notations  Xi  and  iji,  defined  by 

if  1  <  i  <  m, 
if  m  <  i  <  fh, 

if  1  <  i  <  m, 
if  m  <  i  <  fh. 

It  is  clear  that,  in  the  supervised  setting,  the  semi-supervised  part  of  the  training  set 
is  missing,  whence  fh  =  m  and  z  =  z. 

In  the  following  we  will  study  the  generalization  properties  of  a  class  of  estimators  a 
belonging  to  the  hypothesis  space  TL\  the  RKHS  of  functions  on  X  induced  by  the  bounded 
Mercer  kernel  K  (in  the  following  k  =  sup^g^  K(x,x)).  The  learning  algorithms  that  we 
consider,  have  the  general  form 

(1)  /2,A  =  GA(TS)flz, 

where  T*  £  C(H)  is  given  by, 

1  rh 

T*f  -  fl)  E  !<■:, 


and 


Xi 


Xi 

x'} 


—  Vi 

m  & L 
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gz  G  V,  is  given  by, 


1  m  .  m 

g,  =  ^Y  Kityi  =  Y  KXiVi 

m.  L '  m.  L ' 


and  the  regularization  parameter  A  lays  in  the  range  (0,  k].  We  will  often  used  the  shortcut 
notation  A  = 

K 

The  functions  G\  :  [0, ft]  — >  R,  which  select  the  regularization  method ,  will  be  charac¬ 
terized  in  terms  of  the  constants  A  and  Br  in  [0,  +oo],  defined  as  follows 


(2) 

A  =  sup 

sup 

^  (cr  +  A)Ga(ct) 

A€(0,k 

]  (tG[0,k 

(3) 

Br  ~  SUp 

sup 

sup  |1  —  Ga(«t)o-|  (/A-4,  r>0 

t£[0,r] 

se 

o' 

tu 

ct(E[0,k] 

Finiteness  of  A  and  Br  (with  r  over  a  suitable  range)  are  standard  in  the  literature 
of  ill-posed  inverse  problems  (see  for  reference  [11]).  Regularization  methods  have  been 
recently  studied  in  the  context  of  learning  theory  in  [12,  8,  7,  9,  1], 

The  main  results  of  the  paper,  Theorems  1  and  2,  describe  the  convergence  rates  of 
/z,a  to  the  target  function  fn ■  Here,  the  target  function  is  the  “best”  function  which  can 
be  arbitrarily  well  approximated  by  elements  of  our  hypothesis  space  TL.  More  formally, 
fn  is  the  projection  of  the  regression  function  fp(x)  =  fy  ydp\x(y)  onto  the  closure  of  7 i. 
in  C2(X,px). 

The  convergence  rates  in  Theorem  2,  will  be  described  in  terms  of  the  constants  Cr  and 
Ds  in  [0,  -Too]  characterizing  the  probability  measure  p.  These  constants  can  be  described 
in  terms  of  the  integral  operator  Lx  :  C2(X,  px)  — >  B2(X,  px)  of  kernel  K.  Note  that  the 
same  integral  operator  is  denoted  by  T,  when  seen  as  a  bounded  operator  from  TL  to  TL. 

The  constants  Cr  characterize  the  conditional  distributions  p\x  through  fn,  they  are 
defined  as  follows 


(4) 


Cr 


{ 


+00 


if  fn  €  Im  LrK 
if  fn  £  Im  LrK 


r  >  0. 


Finiteness  of  Cr  is  a  common  source  condition  in  the  inverse  problems  literature  (see 
[11]  for  reference).  This  type  of  condition  has  been  introduced  in  the  statistical  learning 
literature  in  [6,  16,  3,  15,  4]. 

The  constants  Ds  characterize  the  marginal  distribution  px  through  the  effective  di¬ 
mension  A/”(A)  =  Tr  [T(T  +  A)-  ] ,  they  are  defined  as  follows 

(5)  Ds  =  IV  sup  s£  (0,1]. 

as(o,i] 


Finiteness  of  Ds  was  implicitly  assumed  in  [3,  4]. 

The  next  theorem  shows  (strong)  universal  consistency  (in  probability)  for  the  estima¬ 
tors  fi, a  under  mild  assumptions  on  the  choice  of  A.  The  function  |a;|  ,  ,  appearing  in  the 
text  of  Theorem  1,  is  the  “positive  part”  of  x,  that  is  x+^x' . 

Theorem  1.  Let  {zm}m=i  be  a  sequence  of  training  sets  composed  ofm  labelled  examples 
drawn  i.i.d.  from  a  probability  measure  p  over  Z ,  and  rhm  —  m  >  0  unlabelled  examples 
drawn  i.i.d.  from  the  marginal  measure  of  p  over  X .  Let  the  regularization  parameter 
choice,  Am  :  N  — >  (0,  ft],  fulfill  the  conditions 


lim  Am  =  0, 

m — >oo 

lim  ^fm\ra 

m — »oo 


(6) 

(7) 


=  oo. 
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Then,  if  Br  <  +oo  for  some  r  >  0,  it  holds  1 

lim  ||/im,Am  ~  fn\\p  =  0. 

Theorem  2  below  is  a  restatement  in  a  slightly  modified  form  of  Theorem  2  in  [2].  In 
particular  the  introduction  of  the  parameter  q  >  1  will  be  useful  when  we  will  merge  this 
result  with  Theorem  3  in  the  proof  of  Corollary  2. 


Theorem  2.  Let  r  >  0,  s  £  (0, 1]  and  a  £  [0,  |2  —  2r  —  s| ,  ].  Furthermore,  let  m.  and  A 
satisfy  the  constraints  A  <  ||T||  and 


(8) 


/  4 Ds  log  |  \  2r+s+ti 

V  \/m  ) 


for  some  q  >  1,  5  £  (0, 1/3)  and  t\  defined  in  eq.  (10).  Finally,  assume  m  >  4  V  mX  a 
Then,  with  probability  greater  than  1  —  3 <5,  it  holds 


11/sA 

where 

0) 

Er 

(10) 

ti 

(11) 

t2 

—  fn\\p  <  <f  Er 


f  ADS  lOg  2r+s+tl 
V  V™  ) 


=  Cr  (30A  +  2(3  +  r)Br  +  1)  +  9 MA, 
=  \2-2r-s\+-a, 

=  1 1  —  2i —  2s  —  fi|  +  . 


The  proofs  of  the  above  Theorems  is  postponed  to  Section  4. 


3.  Adaptation. 

In  this  section  we  show  the  adaptation  properties  of  the  estimators  obtained  by  a 
suitable  data-dependent  choice  of  the  regularization  parameter.  The  main  results  of  this 
section  are  obtained  assuming  that 

(12)  fn  =  fP, 

this  is  true  for  every  p  when  the  underlying  kernel  K  is  universal  (see  [17]).  In  fact  for 
this  class  of  kernels  the  RKHS  TL  is  always  dense  in  C2(X,  px).  The  Gaussian  kernel  is  a 
popular  instance  of  a  kernel  in  this  family. 

Let  the  validation  set 


V  /  V  v  \ 

Z  ( Z\  ,  .  .  .  ,  £mv  J  , 

be  composed  of  mv  labelled  examples  zj  =  (x(,yj)  drawn  i.i.d.  from  the  probability 
measure  p  over  Z  =  X  x  Y .  The  validation  set  zv  is,  by  assumption,  independent  of  the 
training  set  z,  and  these  two  sets  define  the  learning  set 

z  =  (z,z  ), 

which  represents  the  total  input  of  the  adaptive  learning  algorithm.  Following  the  nota¬ 
tions  of  the  previous  Section,  we  let  rh  be  the  total  number  of  examples  in  z,  and  m  the 
number  of  its  labelled  examples. 


say  that  the  sequence  of  random  variables  converges  in  probability  to 

the  random  variable  X  (and  we  write  limm_».oo  Xm  =P  X  or  Xm— if  for  every  e  > 
0,  limm_>.oo  P  [\Xm  —  X\  >  e]  =  0.  This  is  equivalent  to  say  that,  for  every  S  E  (0, 1), 
F  [\Xm  —  X\  >  e(m,  5)]  <  5,  with  limm_>.oo  e(m,  <5)  =  0. 
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Now  let  us  explain  how  zv  is  used  for  the  choice  of  A.  We  consider  the  finite  set  of 
positive  reals  Ar„.  depending  on  m,  the  number  of  labelled  examples  in  z,  and  the  data- 
dependent  choice  for  the  regularization  parameter  is 

-  mv 

(13)  Azv  =  argmin  —  V'(Tm/2, \(xf)  -  yjf , 

xeA m  rrv 

where  the  truncation  operator  Tm  :  C2(X ,px)  — >  C2(X,px)  is  defined  by 
TMf(x)  =  (|/(*)|  A  M)  sign f(x). 

The  final  learning  estimator,  whose  adaptation  properties  are  investigated  in  this  Sec¬ 
tion,  is  defined  as  follows 

(14)  /ztot  =  TmUx,  ■ 

Theorem  3  below  is  the  main  result  of  this  Section  and  shows  an  important  property 
of  the  estimator  /z tot .  It  will  be  used  to  extend  to  /z tot  convergence  results  similar  to  the 
ones  obtained  in  the  previous  Section. 


Theorem  3.  Let  p,  K,  m,  rh,  mv ,  Am,  8  £  (0, 1),  e  >  0  and  Am  £  Am  be  such  that  with 
probability  greater  than  1  —  8,  it  holds 

~  fp\\p  <  e. 

Then,  with  probability  greater  than  1  —  28,  it  holds 


with 


II/.**  —  fp\\P  — 


-2 

e 


2t  + 


80  M2 
mv 


log 


2  |Am 
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The  proof  of  Theorem  3  is  postponed  to  Section  5. 

The  first  corollary  of  Theorem  3  proves  universal  consistency  for  the  estimators  /z° 
under  mild  assumptions  on  the  cardinalities  of  the  grids  A m  and  validation  sets  zv. 


Corollary  1.  Let  K  be  a  universal  kernel,  Q  be  a  constant  greater  than  1,  and  define 

(15)  Am  =  {k,  kQ_1,  . . .  ,/tQ-|Aml+1}, 

with 


(16)  |Am|=w(l). 

Moreover  let  (z)))  }m= i  be  a  sequence  of  learning  sets  drawn  according  to  a  probability 
measure  p  over  Z.  Assume  z )))*  =  (zm,z fn),  with  the  training  sets  zm  composed  of  m  la¬ 
belled  examples  and  rhm  —  m  >  0  unlabelled  examples,  and  z m  the  validation  sets  composed 
by  mfn  =  w(log  |Am|)  examples.  Then,  if  Br  <  +00  for  some  r  >  0,  it  holds 

lim  ||/ztot  -  fp\\  =  0. 

m — >00  11  m  P 


Proof.  The  result  is  a  corollary  of  theorems  1  and  3.  The  universality  of  K  enforces 
the  equality  (12)  (see  [17]).  Condition  (16)  implies  that  the  regularization  parameter 
A m  =  «Q-(L1os1os ”-JaIa-D,  which  belongs  to  Am,  fulfills  the  assumptions  (6)  and  (7). 
Hence,  using  the  assumption  on  m)^,  we  get  that  for  every  8  £  (0, 1),  with  probability 
greater  than  1  —  25 


<o(l)  + 


80M2  2  |Am| 

w(log|Am|)  °g  5 


0. 


□ 
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The  second  corollary  proves  explicit  rates  for  the  convergence  of  /*ot  to  fp  over  specific 
prior  classes  defined  in  term  of  finiteness  of  the  constants  CV  and  Da.  The  main  assumption 
is  the  requirement  mv  >  m/logm.  Since  this  constrain  can  be  fulfilled  still  being  mv 
asymptotically  negligible  with  respect  to  m,  the  rates  (expressed  in  terms  of  m)  that  are 
obtained  in  the  second  part  of  the  corollary  are  minimax  optimal  over  the  corresponding 
priors  (see  [4]). 


Corollary  2.  Let  K  be  a  universal  kernel.  Consider  a  learning  set  ztot  with 

mv  >  lo™m  and  m  >  4  V  m1+v ,  for  some  constants  tj  >  0,  r  >  0,  and  s  £  (0, 1].  Define 

A m  as  in  eq.  (15)  with  Q  an  arbitrary  constant  greater  than  1  and 

(17)  -  logQ  rn  +  l  <  |Am|  <  m, 

with  a  defined  by  eq.  (20). 

Moreover  assume  that  for  some  S  £  (0, 1/6),  m  is  large  enough  that  it  holds 


? a 


(is)  q(4jP^s?)  °  <K~lni- 

Then,  with  probability  greater  than  1  —  65 

(19)  ||/ztot  -  fp\\p  <  4 (QrErDs  +  3M)log(6m/5)  m“5  2r+»+‘i  , 

where  Er,  ti  and  t2  are  the  constants  defined  in  equations  (9),  (10)  and  (11)  substituting 

(20)  a  =  |2  —  2r  —  s|+  A  ^-^(2r  +  s  +  |2  -  2r  -  s|  +  ). 

In  particular,  if  r  +  s  >  |  and  p  =  —  2r+^+  >  and  assuming 


(21) 

and 


2  logQ  m  +  1  <  |Am|  <  m, 


Q 


4  Ds  log  ' 
y/ni 


<«“■  imi. 


with  probability  greater  than  1  —  65,  it  holds 


|| /.tot  -  /p||p  <  4:(QrErDa  +  3M)  log(6m/ 5)  m~ 


2r 

2r+s  m 


Proof.  The  result  is  a  corollary  of  theorems  2  and  3.  The  universality  of  K  enforces  the 
equality  (12)  (see  [17]). 

First,  from  equations  (20)  and  (10),  by  simple  algebra  we  get 

V  =  1 

a  2r  +  s  +  t  i ' 

Therefore  condition  (18)  is  equivalent  to 


Xq  =  q 


4  Ds  log  f 


2r  +  s+t^ 


<  k-1  imi 


Vg  G  [1  ,Q], 


and  condition  X  <  k  1  ||T||  in  the  text  of  Theorem  2  is  verified  by  Xq  for  every  q  £  [1,  Q\. 
Moreover,  since  Ds  >  1  and  5  <  1/6,  for  every  q  £  [1,  Q\  we  can  write 


in  >  4  V  m1+rl  =  4  V  m  [m  “  j  >  4  V  m\q  “ , 

which  shows  that  also  the  other  assumption  of  Theorem  2  is  verified. 

Hence,  by  Theorem  2  we  get  that  for  every  q  £  [1,Q],  with  probability  greater  than 
1  —  35,  it  holds 

/  4DS  log  |  \ 

V  Vm  ) 


Wfz,\  —  fp\\p  <  e  =  QrE, 


2r+s  +  t1 
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The  next  step  is  verifying  that  for  some  q  £  [1,  Q],  Xq  =  nXq  £  Am,  and  hence  applying 
Theorem  3. 

In  fact,  from  definition  (15),  assumption  (17)  and  Proposition  5,  it  is  clear  that 
_ 

minAm  <  k m  <*  <  Xi  <  Xq  <  Xq  <  ||T||  <  k  =  maxAm, 

for  some  q. 

Applying  Theorem  3,  we  get  that  with  probability  greater  than  1  —  65  it  holds 

||/z0t-/P||p<e, 

with,  using  again  condition  (17)  and  the  assumption  mv  >  lomrn ,  the  chain  of  inequalities 


< 

< 

< 


80M2  6  |Am|  \ 

^l0g—  ) 


2 


2  80 M1  2  6m 

e  H - log  — 

m  o 


12  M  ,  6m 
<  e  +  —j=  log  — 
Jm  o 


4QrErDs  log(6/5)  m  2  2r+^+ti  +  12A/log(6m/5)  m 
4(Qr ErDs  +  3 M)  log(6m/5)  m~  '  2-+‘+*i  , 


1 

2 


which  concludes  the  proof  of  the  first  part  of  the  Corollary. 

The  second  part  of  the  Corollary  is  an  instantiation  of  the  previous  result.  In  fact  by 
equations  (20)  and  (10),  the  assumption  r/  =  — 9r+s  +  implies  a  =  |2  —  2r  —  s|+  and 
ti  =  0.  Moreover  from  the  assumption  r  +  s  >  \  and  eq.  (11)  we  get  ti  =  0,  and  noticing 
that  ^  =  2J+s  <  2,  it  is  clear  that  condition  (21)  implies  condition  (17). 

□ 


4.  Proofs  of  Theorems  1  and  2 

In  this  section  we  give  the  proof  of  Theorems  1  and  2.  We  use  various  propositions 
taken  [2],  which  we  state  without  proof. 

4.1.  Before  proving  Theorem  1,  we  begin  showing  some  preliminary  propositions.  The 
first  one  is  a  technical  result  about  sequences  of  real  numbers. 

Proposition  1.  Let  (aijigN  and  {6,}igN  be  two  non-increasing  sequences  of  reals  in  the 
interval  (0, 1)  with 

lim  di  =  0, 

i — >oo 

lim  bi  =  0. 

i — HDO 

Then  there  exists  a  sequence  {ci};gN  of  reals  in  the  interval  (0, 1)  such  that,  defining 
di  =  logCi/logfti,  the  following  properties  hold, 

i)  {dijigN  is  a  non-increasing  sequence  of  positive  reals, 
ii)  (cijigN  is  a  non-increasing  sequence  of  positive  reals,  with 

a  >  ai  Vi  €  N, 

lim  a  =  0. 

i — >oo 

Proof.  We  consider  the  sequence  {ci};gN  of  positive  numbers  constructed  by  the  recursive 
rule 


Ci+\  =  Oi+i  V  (fti+i)10*^  . 
Let  us  prove  point  i)  by  induction. 
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Since  by  assumption  a  1  and  foi  belong  to  (0, 1),  by  construction  d\  =  >  0. 

Now,  for  i  >  1  assume  di  >  0,  then  by  construction,  either  c;+i  =  (bi+ i)  ,  and  hence 
dj+i  =  =  di  >  0,  or  Ci+i  =  ( bi+i)di+1  =  ai+ 1  >  (6i+i)di,  and  hence,  since  ai+1 

and  belong  to  (0, 1),  it  holds 


di+ 1 

(hi+i)di+1  >  (6i+ i)d‘ 


log  ai+i  , 
log  6i+i 

di+i  <  di 


Let  us  now  prove  point  ii). 

First,  by  construction  a  >  at  >  0.  Moreover,  again  by  construction,  either  Ci+i  =  o»+i, 
and  hence, 


Ci+ 1  —  Ui+1  ^  di  F  Ci, 

or  Ci+i  =  (b,+i)di  and  hence,  since  di  >  0  by  point  i),  it  holds 

Ci+i  =  (6;+i)d‘  <  ( bi)di  =  a. 

Therefore  the  sequence  {cijigN  is  non-increasing  and  a  <  ci  =  oi  <  1. 

Finally,  we  prove  that  lim,  a  =  0. 

Let  us  assume  the  there  exists  an  infinite  increasing  sequence  of  naturals  {i(k)}kevi, 
such  that 


Ci(fc)  ^i(k)  V/c  ^  N. 

Since,  by  assumption,  linn  Oi  =  0,  then  lim*,  =  0.  Therefore,  since  we  already 
proved  that  {ai}igF»  is  non-increasing,  limi  a  =  0.  Which  proves  the  Proposition,  if 
{i(fc)}fc6N  exists. 

If  {i(k)}ken  does  not  exist,  by  construction,  there  exists  7  £  N  such  that 
Ci+i  =  (bi+i)di  Vi  >7. 

Therefore,  recalling  the  definition  of  di,  by  induction,  it  follows 

d  =  ( bi)dl  Vi  >  7. 

Recalling  that  d/  >  0  and  lim,  6;  =  0,  the  relation  above  proves  that,  also  in  this  case, 
limi  Ci  =  0. 

□ 


The  next  proposition  introduces  the  functions  /j)r  and  shows  some  simple  results  related 
to  them. 

Proposition  2.  For  any  A  >  0  let  the  truncated  function  fx  be  defined  by 

(22)  fxT  =  P\fn 

where  P\  is  the  orthogonal  projector  in  C2(X,px)  defined  by 

(23)  Px  =  6  a  (Lk), 
with 

<24>  e>w-{;  I 

Then  the  function  a  :  (0,  k]  — >  R,  defined  by 

(25)  a(\)  — •  f\  —  fn  p  ■ 
is  non- decreasing  and  fulfills  the  following  properties 

(26)  0  <  a(A)  <  M  VAg(0,k], 

(27)  lim  a(A)  =  0. 

A^O 


Proof.  Recall  that  the  self-adjoint  integral  operator  Lk  has  a  countable  eigensystem 

i 

{(Ai^OISi  with  positive  eigenvalues  decreasing  to  zero  (see  [5]).  Moreover  L ^  is  an 
isometry  between  C2(X ,px)  and  TL  (again,  see  [5]).  Therefore,  since  fn  is  the  projection 
of  fp  over  the  closure  of  7 i.  in  C2(X,  px),  it  holds 

OO 

i=  1 

Hence,  by  the  definition  of  fff ,  and  recalling  that  Y  C  [—  M,  M\,  we  get 

OO 

0<a(A)2=  ^  |</„^>p|2<53l</p,^>p|2<  \\fP\\2p<M2. 

\i<\  i=i 

Monotonicity  and  convergence  to  zero  for  a(A)  follow  from  the  relation  above  by  standard 
arguments  on  convergent  series  of  positive  numbers.  □ 

The  next  proposition  is  used  in  the  proof  of  Theorem  1. 

Proposition  3.  Let  f  be  a  positive  number.  Then,  there  exists  a  function 

R  ■  (0, 1]  — >  (0,  f] 


such  that 
(28) 

(29) 


kr(X)  |k-*(*)pA/J|  <  4 M,  VA  6  (0, 1], 

II  lip 

lirn  XR(x)  =  0. 

A— *0 


Proof.  Let  {A i,(f)i}  be  the  eigensystem  of  the  positive  compact  operator  Lk  (we  also  use 
the  shortcut  notation  Ai  =  k_1A;).  First,  if  the  range  of  Lk  is  finite  dimensional,  the 
choice  R( A)  =  f  fulfills  trivially  the  required  conditions.  Second,  from  definition  (25),  it 
is  clear  that  if  the  sequence  {d(Ai)}i  has  only  a  finite  number  of  positive  elements,  fn 
belongs  to  the  finite  dimensional  range  of  the  projector  PA,  for  some  positive  A,  and  the 
choice  R(\)  =  r  is  again  a  trivial  solution. 

Therefore  in  the  following  we  assume  A;  >  0  and  d(Ai)  >  0  for  every  i  £  N.  Moreover, 
from  Proposition  5,  A ;  <  k,  and  by  eq.  (26),  a(A)  <  M.  Hence  we  can  apply  Proposition 
1  to  the  non-increasing  sequences  {di};  and  {bi}i  defined  by 


di  = 


d(Ai) 

2  M  ’ 


bt  = 


2k 


The  function  R  is  defined  in  terms  of  the  sequence  {d;}i  constructed  in  Proposition  1 
as  follows 


R(  A) 


Jr  if  Ai  <  A  <  1, 

\  fdi/(r  V  di)  if  Ai+i  <  A  <  A;,  i>  1. 


Equality  (29)  can  be  proved,  recalling  that  by  Proposition  1  a  =  bf !  <  Xf  ‘  goes  to  zero 
as  i  — >  oo,  and  hence 


lim  Ah &  =  lim  Af(A'> 


lim 


((26i)d’) 


r/(rVdi) 


<  2’ 


(  lim  a) 

\i — »oo  / 


r/(rVdi) 


=  o. 
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decreasing.  Therefore,  defining  /;  =  ( fn,4>i}B ,  we  can  write 

2«(A)  II  r-fl(A) 


\L-Kn^’PxfH 


Since  by  Proposition  1  {di}i  is  a  sequence  of  non-increasing  positives,  then  R  is  non- 

p’ 

E  f?K2R{k)  <J2f!K2R{Xi) 

Xi>  X  i 

e^t7™ 

i 

E^1 

i 

<  =  2M^/^(Ai)“1 


(fef*  =  Ci  <  l) 


[  hV  =a<  1  j 

(Ci  >  a») 


< 


2ME  E  /*«(a>) 

fe=0  2-fc-1<a(Ai)/M<2-fc 


<  2^2 


fc  +  1 


E 


/i 


1<a(Xi)/M<2~k 


x(A)2  =  E 


<  4M2^2“fc=8M2, 


which  proves  inequality  (28)  and  concludes  the  proof. 


□ 


We  now  state  four  propositions  from  [2].  The  first  one  introduces  the  empirical  and 
ideal  estimators  least-squares  /~SA  and  /As. 


Proposition  4.  Assume  A  <  ||T||  and 
(30) 

for  some  5  £  (0, 1).  Then,  with  probability  greater  than  1  —  5,  it  holds 


Am  >  16kA/”(A)  log2  — , 
o 


(T  +  A  )*  (/&  -  fx)  <8  M  +  .  K-  /a 


2  [ K  Aft  A)  \  ,  6 

I  II  —  V  T  +  V  -  log  T 

\nJ\myA  V  m  d 


where 

(31) 

(32) 

Proof.  See  Proposition  1  in  [2]. 


/«!a  =  (Tx  +  A)’1*?,, 
/a  =  {T  +  A)_1Lk/w. 


□ 


The  second  one  gives  two  simple  properties  of  the  operator  T  and  the  effective  dimension 
W(A). 

Proposition  5.  For  every  probability  measure  px  and  A  >  0,  it  holds 

imi  < «, 

and 

AAf(A)  <  k. 


Proof.  See  Proposition  2  in  [2].  □ 

The  other  two  propositions  from  [2]  estimate  two  different  terms  which  appear  in  the 
proofs  of  Theorems  1  and  1.  The  symbol  [x\  in  the  text  below  represents  the  greater 
integer  less  or  equal  to  x. 


11 


Proposition  6.  Let  f  belong  to  Im  LrK  for  some  r  >  0.  Then,  if  A  £  (0, «],  it  holds 

I  Vf  (. Ga(T ~)  T*  -  Id)  Pa/||  <  Br  ||Lk7|L  (1  +  Vt)(2  +  n\i~r  +  ^ )Xr , 

II  II  7"£  ” 

where  P\  is  defined  in  eg.  and 

(33)  7  =  A-1  ||T  —  Tx|| , 

v  =  k-^l-Lk-^IJ- 

Proof.  See  Proposition  6  in  [2].  □ 

Proposition  7.  Let  t/ie  operator  Q\  be  defined  by 

(34)  ftA  =  VtGa(Tx)(Tx  +  A)(T  +  A)-5. 

Then,  if  A  G  (0,  k],  It  holds 

II^aII  <  (1  +  2^7)  A, 

with  7  defined  in  eg.  (33). 

Proof.  See  Proposition  7  in  [2].  □ 

We  finally  need  the  following  probabilistic  inequality  based  on  a  result  of  [14] ,  see  also 
Tli.  3.3.4  of  [18].  We  report  it  without  proof. 

Proposition  8.  Let  (ft,  T ,  P)  be  a  probability  space  and  £  be  a  random  variable  on  ft  tak¬ 
ing  value  in  a  real  separable  Hilbert  space  1C.  Assume  that  there  are  two  positive  constants 
H  and  a  such  that 

IkHIk  <  f  a.s, 
e[II£IIk1  < 

then,  for  all  m  £  N  and  0  <  S  <  1, 

<35>  *■<“ . .  S£««>-*K]  + 

L  i=l  K 

We  are  now  ready  to  prove  Theorem  1. 

Proof  of  Theorem  1.  Let  us  consider  the  expansion 

Vf(fs,x-fn)  =  Vf  (GxiT*)  g*  -  ft)  +Vf(tf-fn) 

=  Six  (T  +  A)  i  (fi%  -  &,x)  +  Vf  (Gx(T*)  T*  -  Id)  +  Vf(ff  -  fH) 

=  &x  ((T  +  A  )i(flx  ~  fx)  +  (T  +  A  )5  (/Is  -  fl)  +  (T  +  A)*  (fl  -  /£*)) 
+Vf  (Ga(Tx)  Tx  -  Id)  +  Vf  iff  -  fn) 

where  the  operator  ft  a  is  defined  by  equation  (34),  the  ideal  RLS  estimators  are  ff  = 
(T+\)~1Tm  fn  and  jf  =  (T+A)_1T/Ar,  and  =  (T*+A)_1T*/k  is  the  RLS  estimator 
constructed  by  the  training  set 

Z  =  (kljx'kl))  •••,(**, /f  (®m)))- 
Hence  we  get  the  following  decomposition, 

||/a, a  -  fn\\p  <  D( z,  A)  (,Sls(z,  A)  +  R( A)  +  Sls( z,  A))  +  P( z,  A)  +  Ptr( A), 


(36) 
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with 

(37) 


Sls(z,X)  =  ||(T  +  A)i(/^A-/is)||H, 

Sls(  z,A)  =  ||(T  +  A)i(/2',A-/1A)||H 
D(z,  A)  =  ||0a||, 

P(z,A)  =  \\VT  (GxiT^Tst -Id)  fl1 
Ptr(  A)  =  ||/f-/H||p, 

7?(A)  =  |  (T  +  A)  *  (/a  —  /a)|I,  • 


Terms  Sls  and  Sls  will  be  estimated  by  Proposition  4,  terms  Ptr  and  R  by  Proposition 
2,  term  D  by  Proposition  7,  and  finally  term  P  by  Propositions  6,  3  and  2. 

i 

Step  1:  Estimate  of  Sla.  Since  L ^  is  an  isometry  between  C2(X ,px)  and  TL  (see  [5]), 
we  obtain 


(38) 


/a  <  VT(T  +  A)”1 


Llfn 


<  Wfn  ||p  <  M_ 
\/A  \/A 


Now,  let  5  be  an  arbitrary  real  in  (0, 1).  From  the  assumptions  on  A m,  for  large  enough 
m,  we  have 


XmVrn  >  4k  log 


6 
5  ’ 


<  mi . 


Hence,  by  Proposition  5,  for  large  enough  m,  the  assumptions  of  Proposition  4  are  verified, 
and  we  get  that  with  probability  greater  than  1  —  <5 


Sls(zm,Am)  <  8  (m  +  x[^P  flxsm 

V  V  mm 


2  I  H  j  Xf  (Xrti  J  i  . 

\n  J  \  mV  Xm  V  m  )  ®  5 


(Prop.5,eq.  (38)) 

(Am  <  k,  m  >  4) 

Hence  it  holds 
(39) 


<  8A7  (  1  + 

< 


P)4=f 


2,/^  +  ,/PK? 


32  A  {  k  6 

t — mog-, 

A  my/rn  d 


Am  J  \J Tfl  Y  y  Amm  y  A 

o. 


lim  Sls(zm,Am)  =  0. 


m — »oo 


Step  2  :  Estimate  of  S  .  This  term  can  be  estimated  observing  that  z '  is  a  training  set 
of  rh  supervised  samples  drawn  i.i.d.  from  the  probability  measure  p'  with  marginal  px 
and  conditional  p\x(y)  =  S(y  —  Therefore  the  regression  function  induced  by  p'  is 

fpi  =  /j|r,  and  the  support  of  p'  is  included  in  X  x  [-M',  M'\,  with  M'  =  sup^g^  fp>  (*)  < 
m||/lr||H.  Reasoning  as  in  the  analysis  of  Sls,  we  obtain  that,  for  every  5  G  (0, 1)  and 
large  enough  m,  with  probability  greater  than  1  —  6  it  holds 


Sls(zm,  Am) 

< 

+ 

00 

INL) 

(Prop.  5) 

< 

16 /. 

ll 

«  7^  (2 

(m  >  4) 

< 

32k 

PXr 

nL~JpXm 

\/A  mm 

2 

film 


X  mm 


A7(A„ 


T  )  log 

An 


6  32kM  6 


log 


6 

8 


->  0. 


lim  Sls(zm,Am)  =  0. 


Hence  it  holds 
(40) 
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Step  3:  Estimate  of  Ptr.  By  definition  (25),  Ptr( A)  =  a(A).  Hence  from  eq.  (27) 

(41)  lim  Ptr(Am)  =  lim  a(Am)  =  lim  a(A)  =  0, 

m — M30  m — »oo  A — >0 

where  we  used  the  assumption  (6). 

Step  4:  Estimate  of  R.  Since  from  the  definitions  of  f\  and  /]), 

P(A)  =  | (T  +  X)~^T(fx  —  /a  )||  <  \\y/f(f£  -  fn)\\  <Ptr( A), 

II  Im  II  II  h 

from  (41)  we  get 

(42)  lim  R(Xm)  =  0. 


Step  5:  Estimate  of  D.  In  order  to  estimate  D( z,A),  we  have  first  to  estimate  the 
quantity  7  =  y(z,  A)  (see  definition  (33))  appearing  in  the  Proposition  7.  Our  estimate 
for  y(z,  A)  follows  from  Proposition  8  applied  to  the  random  variable  £  :  A'  — >  ChsCH) 
defined  by 

£(*)[•]  =  A  1 /C.< /v, ,,->«• 

We  can  set  H  =  and  a  =  W)  and  obtain  that,  for  every  5  G  (0, 1)  and  m  >  4,  with 
probability  greater  than  1  —  <5 


7(z,n,  A)  <  A 


IIP -Is 


II  <  2 

Hhs  S  a 


log  =  <  4 


log 


2 
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t{m ,  A,  5). 


From  the  expression  of  e(m,  A,  <5)  we  see  that,  by  the  assumption  (7),  for  every  5  €  (0, 1), 


and  hence, 
(43) 


lim  e(m,\m,S)  =  0, 

m — >00 

p 

lim  7(zm,  Am)  =  0. 

m — ^00 


Finally,  from  eq.  (43)  and  Proposition  7  we  find 

(44)  D(zm,  A m)  <  (1  +  2y/7(zm,  A m)j  A  -S-  34. 

Step  6:  Estimate  of  P.  First,  notice  that  by  the  definition  (3),  WLOG  we  can  assume 
r  <  ^  ■  Moreover  by  condition  (6),  we  can  assume  m  large  enough  that  A m  <  k.  We 
consider  the  function  R  introduced  by  Proposition  3,  and  apply  Proposition  6,  with  /  = 
and  rm  =  P(k-1  Am)  <  f,  getting 

P(zm,  Am)  <  Bf  (l  +  y(z m,  Am)5  j 

^2  +  rm7(zra,Am)  +7(zm,  Am)5“rmj  Kr™  P\mfn\\ p  (K~1Xm)rm- 

This  result  together  with  eq.  (43),  and  recalling  that  by  Proposition  3  and  assump¬ 
tion  (6),  the  sequence  (rm}m  verifies  the  two  conditions 

Krm  ||^rmPAm/w||p  <  4 M  Vm, 

lim  (k_  1  Am)fm  =  0, 

m — >00 

proves  that 

(45)  lim  P(zm,Am)  =  0. 


The  proof  of  the  Theorem  is  completed  considering  the  limit  m 
and  using  equations  (39),  (40),  (41),  (42),  (44)  and  (45). 


00  of  estimate  (36), 
□ 
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4.2.  Before  showing  the  proof  of  Theorem  2,  we  state  two  propositions  from  [2]  which  de¬ 
scribe  properties  of  the  functions  /j(r  and  (defined  in  eq.  (22)  and  eq.  (32)  respectively) 
when  fn  £  Im  LrK. 

Proposition  9.  Let  fn  £  Im  LrK  for  some  r  >  0.  Then,  the  following  estimates  hold, 

\\f?-fn\\  <  A  r\\L^fn\\, 


II K 


\n  $ 


A  2 


—  7T~\-r 


\Lxrfn\\ 

\L~Krfn\\ 


ifr  <  b 

ifr  >  I. 


Proof.  See  Proposition  3  in  [2].  □ 

Proposition  10.  Let  fn  £  Im  LrK  for  some  r  >  0.  Then,  the  following  estimates  hold, 


fx-fn 


fx 


Xr 


< 


T  ~r 

1% 


fn\\ 


if  r  <  1 


~Z+r  r- 


A  2 
_  1 

K,  2 


L~Krfn  || 

Kfn\\ 


ifr<b 

ifr>\. 


Proof.  See  Proposition  1  in  [2]. 

We  are  now  ready  to  prove  Theorem  2. 


□ 


Proof  of  Theorem  2.  We  consider  the  same  decomposition  (see  equations  (36)  and  (37)) 
for  \\fi,x  ~  fn  ||  that  we  used  in  the  proof  of  Theorem  1. 

Terms  Sls  and  Sls  will  be  estimated  by  Proposition  4,  term  D  by  Proposition  7,  term 
P  by  Proposition  6  and  finally  terms  Pu  and  R  by  Proposition  9. 

Let  us  begin  with  the  estimates  of  Sls  and  S  .  First  observe  that,  by  Proposition  5,  it 
holds 

A  <  k_1  ||T||  <  1, 

therefore,  since  by  assumption  fh  >  mA”'2_2l'_^++tl  >  mA^1“2r^  +  +tl ,  we  get, 

Am  >  AH1-2r|++1+tlm  >  X2r+tlm. 

Moreover,  by  eq.  (8)  and  definition  (5),  we  find 

X2r+tlm  =  l6q2r+s+tlD2X~s  log2  >  16W(A)  log2  f , 

o  o 

hence  the  hypothesis  (30)  in  the  text  of  Proposition  4  is  verified. 

Regarding  the  estimate  of  S.  Applying  Proposition  4  and  recalling  that  by  assump¬ 
tion  fh  >  mA-^2-2r-^  +  +fl  >  mX~^~2r^++tl  and  from  Proposition  10,  \/^  H/aII-h  — 


CrX~ 


(46) 


'+,  we  get  that  with  probability  greater  than  1  —  5 


Sls(z,  A)  < 


8  (  M  +  \  j  -rCrX 
m 


— 


m 


log  ■ 


<  8(M  +  A_TCr) 


(eq.( 8))  =  2q_r_5_T  (M  +  Cr)Xr  I  1  + 


Da\  6 

°S* 

^  t{!  2r  2s~t,) 

2^+1  +  ^  D?  log  § 


(ti  >  0,  q  >  1)  <  3q  r  2  2  (M  +  Cr)Xr  2 

(q  >  1)  <  3(M  +  a)Ar“T. 
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The  term  Sls  can  be  estimated  observing  that  z  is  a  training  set  of  m  supervised 
samples  drawn  i.i.d.  from  the  probability  measure  p'  with  marginal  px  and  conditional 
p'\x{y )  =  5{y  —  Therefore  the  regression  function  induced  by  p'  is  fpi  =  /Jr,  and 

the  support  of  p'  is  included  in  X  x  with  M 1  =  sup xex  fp'(x)  <  \fn  ||/J[r||  . 

Again  applying  Proposition  4,  we  obtain  that  with  probability  greater  than  1  —  5  it  holds 


(47)  Sls(Z,  A)  < 

< 

{Prop.  9)  < 

< 

(eg-  (8)) 

{ti>0,q>l)  < 

(9  >  1)  < 


log 


6 

<5 


167k  ||  fx  || n 


log 


6 

6 


log 


6 

5 


i6a 


2  ,D,\  6 

°e~s 


4  q 


—  r—  — 


*-^Cr  X 


j(l-2r-2s-ti) 

2qr+i+^i  D1  log  | 


6  Q 


5-TCr  Ar 


*2 

2 


6CrAr“ 


2 


In  order  to  get  an  upper  bound  for  D  and  P,  we  have  first  to  estimate  the  quantity 
7  =  7(25,  A)  (see  definition  (33))  appearing  in  the  Propositions  6  and  7.  Our  estimate  for 
7(z,  A)  follows  from  Proposition  8  applied  to  the  random  variable  £  :  X  — >  £hs(7"0  defined 
by 

£(*)[•]  =  A  ~1Kx{Kxl-)n. 

We  can  set  H  =  ^  and  o  =  HA  .  and  obtain  that  with  probability  greater  than  1  —  <5 


7(2,A)<A-1||T’-Tx||hs< 


2  / 2k  k  A 

Vrh) 


2  12 
log  T  <  4-r  j=  log  - 
o  Av  m  o 


<  4 


log  y  <  Al1“r-il  +  -(1-r-4)  <  A1'  ’ ‘I  -  <  1, 
m  o 


where  we  used  the  assumption  m  >  4  V  m\  2  2r  sl  +  +tl  and  the  expression  for  A  in  the 
text  of  the  Theorem. 

Hence,  since  7(z,  A)  <  A1  2  from  Proposition  7  we  get 


(48) 

D{  z,A)  <  {1  +  2^)A<3A, 

and  from  Proposition  6 

(49) 

P( 2,  A) 

< 

BrCr{  1  +  ^7)(2  +  r7Ai“r  +  7’7)A’' 

< 

2BrCr(3  +  n\?-r)\r 

< 

2BrCr{3  +  rAlr+i_1l++5“r)Ar 

< 

2BrCr{3  +  r\^)\r  <  2BrCr{3  +  \ 

Regarding  terms  PtT  and  R.  From  Proposition  9  we  get 

Ptr(  A)  <  Cr\r, 


(50) 
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and  hence, 
(51) 


R(X)  =  |(T+A)-5T(^-/is)|w 

<  ||VT(/aS-/a)||  <Ptr<Cr\r. 


The  proof  is  completed  by  plugging  inequalities  (46),  (47),  (48),  (49),  (50)  and  (51) 
in  (36),  recalling  the  expression  for  A.  □ 

5.  Proof  of  Theorem  3 

The  following  result  is  due  to  [13],  adapted  to  a  suitable  form  used  in  this  paper. 

Proposition  11.  Let  {A+}"el  be  a  set  of  real  valued  i.i.d.  random  variables  with  mean  /x, 
|A'i|  <  B  and  E[(X;  —  /x)2]  <  a2,  for  all  i  £  {1, . . .  ,  n}.  Then  for  arbitrary  a  >  0,  t  >  0, 


(52) 
and 

(53) 


1  " 

TA)-/x> 

i= 1 

1  n 


a<r  +  e 


a< r  +  e 


<  e  3+4 ocB  ; 


<  e  3+4o,B  . 


Proof.  It  suffices  to  prove  the  one  side  inequality  (52).  For  any  s  >  0, 


EXi  —  a  >  aa2  +  < 


eJ  £"=l  (•*<-#»)  >  g»(aCT2  +  e) 


<  e 


se  saa  **) ,  by  Markov  inequality 

n 

Ee  "  ^ x'  by  independence  of  Xi 


—  se  —  sctcr 


Denote  Zi  =  Xi  —  /x,  t  =  s/n  and  B\  =  2 B.  Thus  for  those  s  such  that  sB  <  3n/2  (or 
equivalently  Bit/3  <  1), 


Ee 


whence 


=  i  + 


fc! 

fc  =  l 

3t2  a2 
6  -  4 Bt 


<  exp 


k= 2 
2  2 


(f 


3t  cr 
6-4  Bt 


exp 


o  2  2 

os  a 


—  se  —  sacr' 


riEe 


HXi-n) 


—se 

<  e  exp 


n2(6  —  4  sB/n) 
3s 


i=l 

6  an 


{  (  6  —  4s_B /n  ~  nQ)  } 


Setting  s  =  so  =  g  ‘  j->  (one  can  check  that  so  B  =  <  3n/2),  we  have  — 


na  =  0  and  thus  r.h.s.  <  e  s°e  =  exp  (  — 


6  nae 


4, 


3  +  4 aB J 


which  gives  estimate  (52). 


□ 


We  are  now  ready  to  prove  Theorem  3. 

Proof  of  Theorem  3.  The  strategy  of  the  proof  is  the  following.  Define 
(54)  A^  =  argmin  f  (TMfi.\(x)  -  y)2 dp. 

AGA„,  Jz 

Notice  that,  since  for  every  /  £  C2(X,px), 

[  if(x)-  y)2dp=\\f  -  fP\\2  +  [  ( fP(x)-y)2dp , 

J  z  J  z 
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definition  (54)  of  A)/,  is  equivalent  to 

A *m  =  argmin  \\TMfz,x  -  fp |L  . 
aga  m 

Now,  from  the  equality  above,  the  assumption  of  the  Theorem,  and  recalling  that 
fP(x)  £  f'C  [— M,  M ],  we  get  that  with  probability  greater  than  1  —  <5  it  holds 

(55)  \\TMfi,x^  -  fP\\p  <  \\TMfi,\m  -  fP\\p  <  \\h,xm  ~  JP llp  <  £• 


We  claim  that  for  every  z,  A  >  0  and  <5  £  (0, 1),  with  probability  greater  than  J  —  <5 
over  the  probability  measure  pm  ,  it  holds 


(56) 

l|/ztot 

f  ii 2  ^  oM rr,  t  t  ii 2  ,  80 M2  2|Am| 

fP\\p<2\\TMMx*m  fp  ||p+  mv  log  g  . 

Estimates  (55)  and  (56)  together  will  complete  the  proof  of  the  Theorem. 

We  now  proceed  to  proving  eq.  (56).  For  i  =  1  ,...,mv,  let  us  define  the  random 
variables 

&  =  (Tm/z,a(zI)  -  Vi)2  -  (fP(x!)  -  ylf  . 

Clearly 

Itfl 

< 

AM2, 

= 

[  (TMfz,\(x) -y)2  dp-  [  {fp{x)-yf  dp=\\TMfi,x- fp\\2, 

J  z  J  z 

E[(£A)2] 

= 

[  (TMfz,\{x)  -  fp(x))2  (TMfz,x(x)  +  fp(x)  -  2 y)2  dp 

J  z 

<  16M2  || TmU,x  -  fP ||“  =  16M2E[^]. 


Hence,  using  Proposition  11  with  A';  =  £/,  p  =  E[£A],  B  =  AM2  and  a2  =  E[(£A)2]  < 
l&M2  p,  we  obtain  that  for  all  A  £  Am  with  probability  greater  than  1  —  <5, 

1  mv 

—  <(l  +  a')E[tf]  +  e, 

i= 1 

and 


\  i= 1 


1  -  a' 


where  a!  =  16aA/2  and  e  =  ^  —  log  —  .  Therefore 


ll/ztot  —  fp 


6 am'  l0g  '  5 


+ 


1  -  a' 


< 


< 


-f-EeA; 


l -a' 


1  —  a'  ym  ,_1 

+  I2e- 


Setting  a  =  1/(48 M2),  this  gives  a'  =  1/3  and 


ll/ztot  -  /J2  <  2  ||  Tm/j^  -  fp\\]  +  log  ^ 


which  proves  eq.  (56),  as  desired. 


□ 
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